For this project, we found a set of data that listed cases of different infectious dieases in various counties in California, as well as the total population of each county. It included how many males and females were infected per year, as well as the total number of infections per county per year.
The above scatterplot shows the average count of infectious disease among counties in the state of California by sex. Additionally, we have include the total, male and female, count for emphasis. Interesting aspects of the plot include an apparent sinusoidal behavior amongst males, females, and male/female total. Curiously, the infectious diseases appear to occur in waves for both men and women. In general, the implemented trend lines suggest that despite the “up and down” behavior of the data, infectious disease appear to be increasing in number of cases as time increses, with women contracting more overall number of diseases than men. Data such as this could be used to analyze which sex would be in more need of vaccinations, healthcare reform programs, or overall affect value. Ultimately, the total number of cases continues to increase for both men and women combined.
require("jsonlite")
require("RCurl")
require("ggplot2")
require("dplyr")
require(extrafont)
df <- data.frame(fromJSON(getURL(URLencode('skipper.cs.utexas.edu:5001/rest/native/?query="select * from Infectious_Diseases where COUNTY NOT IN (\'California\')"'),httpheader=c(DB='jdbc:oracle:thin:@sayonara.microlab.cs.utexas.edu:1521:orcl', USER='C##cs329e_cmm5627', PASS='orcl_cmm5627', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE), ))
df %>% select(SEX, mean(COUNT), YEAR) %>% ggplot(aes(y = mean(COUNT), x = YEAR, color = SEX)) + ggtitle("Average Count of Infectious Diseases by Sex") + geom_point() + scale_y_continuous() + scale_x_continuous() + coord_cartesian()
Male_df <- df %>% select(SEX, mean(COUNT), YEAR) %>% filter(SEX == "Male")
Female_df <- df %>% select(SEX, mean(COUNT), YEAR) %>% filter(SEX == "Female")
Male_df %>% select(SEX, mean(COUNT), YEAR) %>% ggplot(aes(y = mean(COUNT), x = YEAR, color = "Male")) + ggtitle("Average Count of Infectious Diseases by Sex") + geom_point() + scale_y_continuous() + scale_x_continuous() + coord_cartesian()
Female_df %>% select(SEX, mean(COUNT), YEAR) %>% ggplot(aes(y = mean(COUNT), x = YEAR, color = Female)) + ggtitle("Average Count of Infectious Diseases by Sex") + geom_point() + scale_y_continuous() + scale_x_continuous() + coord_cartesian()
source("../03 R SQL Visualizations/Scatter1.R", echo = TRUE)
##
## > require("jsonlite")
## Loading required package: jsonlite
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
##
## > require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
##
## > require("ggplot2")
## Loading required package: ggplot2
##
## > require("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## > require(extrafont)
## Loading required package: extrafont
## Registering fonts with R
##
## > df <- data.frame(fromJSON(getURL(URLencode("skipper.cs.utexas.edu:5001/rest/native/?query=\"select * from Infectious_Diseases where COUNTY NOT IN (' .... [TRUNCATED]
##
## > df %>% select(SEX, mean(COUNT), YEAR) %>% ggplot(aes(y = mean(COUNT),
## + x = YEAR)) + ggtitle("Average Count of Infectious Diseases by Sex (TOTA ..." ... [TRUNCATED]
source("../03 R SQL Visualizations/Scatter2.R", echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > require("ggplot2")
##
## > require("dplyr")
##
## > require(extrafont)
##
## > Male_df <- df %>% select(SEX, mean(COUNT), YEAR) %>%
## + filter(SEX == "Male")
##
## > Male_df %>% select(SEX, mean(COUNT), YEAR) %>% ggplot(aes(y = mean(COUNT),
## + x = YEAR, color = "Male")) + ggtitle("Average Count of Infectious ..." ... [TRUNCATED]
source("../03 R SQL Visualizations/Scatter3.R", echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > require("ggplot2")
##
## > require("dplyr")
##
## > require(extrafont)
##
## > Female_df <- df %>% select(SEX, mean(COUNT), YEAR) %>%
## + filter(SEX == "Female")
##
## > Female_df %>% select(SEX, mean(COUNT), YEAR) %>% ggplot(aes(y = mean(COUNT),
## + x = YEAR, color = "Female")) + ggtitle("Average Count of Infecti ..." ... [TRUNCATED]
These crosstabs display how many people per county were infected with some disease, broken down by males and females, as well as the two combined. The county called “California” is the total of all of the counties. Some counties to note are Alameda, where a significantly larger amount of females had diseases versus males, and the same is true for Fresno. Modoc had very few total diseases, as well as Alpine.
require("jsonlite")
require("RCurl")
require("ggplot2")
require("dplyr")
KPI_Very_Low_value = 0
KPI_Low_value = 10
KPI_Medium_value = 100
df <- diseases %>% group_by(SEX, COUNTY) %>% summarize(avg_count = mean(COUNT)) %>% mutate(kpi = avg_count) %>% mutate(kpi = ifelse(kpi <= KPI_Very_Low_value, 'Very Low', ifelse(kpi <= KPI_Low_value, 'Low', ifelse(kpi <= KPI_Medium_value, 'Medium', 'High'))))
ggplot() +
coord_cartesian() +
scale_x_discrete() +
scale_y_discrete() +
labs(title='Number of Diseases per Sex per County') +
labs(x=paste("SEX"), y=paste("County")) +
layer(data=df,
mapping=aes(x=SEX, y=COUNTY, label=round(avg_count, 2)),
stat="identity",
stat_params=list(),
geom="text",
geom_params=list(colour="black", size = 3),
position=position_identity()
) +
layer(data=df,
mapping=aes(x=SEX, y=COUNTY, label=round(avg_count, 2)),
stat="identity",
stat_params=list(),
geom="text",
geom_params=list(colour="black", vjust=4, hjust=-1, size = 3),
position=position_identity()
) +
layer(data=df,
mapping=aes(x=SEX, y=COUNTY, fill=kpi),
stat="identity",
stat_params=list(),
geom="tile",
geom_params=list(alpha=0.50),
position=position_identity()
) + theme(axis.text.y = element_text(face = "plain", size = 8))
source("../01 Data/InfectiousDiseases.R", echo = FALSE)
source("../03 R SQL Visualizations/Crosstab_jnw.R", echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > require("ggplot2")
##
## > require("dplyr")
##
## > KPI_Very_Low_value = 0
##
## > KPI_Low_value = 10
##
## > KPI_Medium_value = 100
##
## > df <- diseases %>% group_by(SEX, COUNTY) %>% summarize(avg_count = mean(COUNT)) %>%
## + mutate(kpi = avg_count) %>% mutate(kpi = ifelse(kpi <= KP .... [TRUNCATED]
##
## > ggplot() + coord_cartesian() + scale_x_discrete() +
## + scale_y_discrete() + labs(title = "Number of Diseases per Sex per County") +
## + labs( .... [TRUNCATED]
These bar charts show each disease in a panel. Each panel shows the count each year from 2001 to 2014. Specifically, each bar within the panel sums up the count for that particular disease spanning all counties of California for that specified year. The reference line is the average count for each disease over the course of 14 years (2001-2014). It is particularly interesting to observe which years had above the average count for each disease. These bar charts are interesting because you can see how the total count changes from year to year within each disease, and you can also see if males or females have the higher count. For example, the data suggests that the Chlamydia count has a positive correlation with time and also that more females tend to have Chlamydia compared to men. I decided to include all diseases so that it is easy to see which disease is affecting Calfornia’s population the most. As you can see, Chlamydia, Gonorrhea, Campylobacteriosis, HIV, Salmonellosis, Early Syphilis, Coccidioidomycosis, Pertussis, Tuberculosis, Giardiasis, Shingellosis have the above 3000 average counts. Certain diseases can easily be filtered to study trends in diseases with lower counts. Here is an example: